In this part of this study, for initial modeling and analysis, I will be looking at the total number of thefts from January 2001 - March 11, 2023 that were reported. Note that due to the large size of the original dataset (nearly 8 million rows), the raw data is not included in this repository. The raw data can be accessed here.
For this partition of the data, there are two variables: year/month and the number of thefts reported in each month. The full dataset has more variables, which are described below. Each row in the full dataset represents an individual crime that was reported.
| ID | Unique identifier for the record. |
|---|---|
| Case number | Chicago id for the case number |
| Date | Date when the incident occurred, this is sometimes a best estimate. |
| Block | The partially redacted address where the incident occurred. |
| IUCR | The Illinois Uniform Crime Reporting Code. |
| Primary Type | The primary description of the IUCR code. |
| Description | The secondary description of the IUCR code. |
| Location description | The primary description of the location where the incident occurred. |
| Arrest | Whether or not the incident resulted in an arrest. |
| Domestic | Whether or not the incident was a domestic incident. |
| Beat | Indicates the beat where the incident occurred. |
| District | The police district where the incident occurred. |
| Ward | The city council district where the incident occurred. |
| Community Area | The community area where the incident occurred. |
| FBI Code | FBI Code crime classification. |
| X Coordinate | The X coordinate location where the incident occurred. |
| Y coordinate | The Y coordinate where the incident occurred. |
| Year | Year the incident occurred. |
| Updated on | Date and time the record was last updated. |
| Latitude | The latitude where the incident occurred. |
| Longitude | The longitude where the incident occurred. |
| Location | The location of the incident. |
The population data for this project was extracted using the US Census API ACS, which provided population data from 2005-2019 from Cook County. Since Cook County is the base of the Chicago metropolitan area, I chose to extract this data, and include a variable that scaled the number of thefts over time by the yearly population. So far, I haven’t been able to track down monthly population data for the Chicago area. Note that also the data from 2001-2005 is harder to access as it is only available through the population estimates API. Therefore, although the original dataset goes from 2001-2023, at this time, I am focusing on 2005-2019. The population from the Chicago website is limited to 2018-2022.
thefts <- read_csv("data/thefts_per_capita.csv") %>%
select(-year)
## Rows: 180 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (4): NumThefts, year, population, TheftsPer1000
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chicago_population <- read_csv("data/cook_county_estimates_2005-2019.csv") %>%
mutate(year = year + 2004) %>%
pivot_wider(names_from = variable, values_from = estimate)
## Rows: 30 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): NAME, variable
## dbl (3): year, GEOID, estimate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(thefts)
## # A tibble: 6 × 4
## Month NumThefts population TheftsPer1000
## <chr> <dbl> <dbl> <dbl>
## 1 2005 Jan 6114 5206357 1.17
## 2 2005 Feb 5592 5206357 1.07
## 3 2005 Mar 6570 5206357 1.26
## 4 2005 Apr 7029 5206357 1.35
## 5 2005 May 7276 5206357 1.40
## 6 2005 Jun 7408 5206357 1.42
head(chicago_population)
## # A tibble: 6 × 5
## year GEOID NAME median_income total_pop
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2005 17031 Cook County, Illinois 48950 5206357
## 2 2006 17031 Cook County, Illinois 50691 5288655
## 3 2007 17031 Cook County, Illinois 52564 5285107
## 4 2008 17031 Cook County, Illinois 54582 5294664
## 5 2009 17031 Cook County, Illinois 52539 5287037
## 6 2010 17031 Cook County, Illinois 51466 5200950
thefts_ts <- thefts %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
chicago_population_ts <- chicago_population %>%
as_tsibble(key = c(GEOID, NAME), index = year)
chicago_crime <- read.csv("data/thefts_by_month.csv")
chicago_crime <- chicago_crime %>%
select(-X) %>%
rename(NumThefts = sum.Count.) %>%
drop_na(month)
head(chicago_crime)
## year month NumThefts
## 1 2022 10 5224
## 2 2015 2 3228
## 3 2019 10 5390
## 4 2001 1 7867
## 5 2017 3 4493
## 6 2008 8 8501
chicago_crime_monthly <- chicago_crime %>%
mutate(month = month.name[month]) %>%
mutate(Month = str_c(year, month, sep = " ")) %>%
select(Month, NumThefts) %>%
mutate(Month = yearmonth(Month)) %>%
filter(year(Month) < 2023) %>%
as_tsibble(index = Month)
head(chicago_crime_monthly)
## # A tsibble: 6 x 2 [1M]
## Month NumThefts
## <mth> <int>
## 1 2001 Jan 7867
## 2 2001 Feb 6669
## 3 2001 Mar 7765
## 4 2001 Apr 7686
## 5 2001 May 8420
## 6 2001 Jun 8612
chicago_crime_monthly %>%
autoplot(NumThefts)
chicago_crime_monthly %>%
ACF(NumThefts) %>%
autoplot()
chicago_crime_monthly %>%
gg_lag(NumThefts, lags = 1:12)
chicago_crime_monthly %>%
gg_season(NumThefts)
chicago_crime_monthly %>%
gg_subseries(NumThefts)
Note: I am working on getting the population of Chicago instead, but this is more complicated to mine from and will be a stretch goal.
chicago_population_ts %>%
autoplot(total_pop)
thefts_ts %>%
autoplot(TheftsPer1000)
thefts_ts %>%
gg_lag(TheftsPer1000, lags = 1:12)
thefts_ts %>%
gg_season(TheftsPer1000)
Initially, I will test the four simple models that we’ve covered in class: naive, seasonal naive, mean, and random walk with drift. There appears to be a stronger seasonal component than trend component, so my guess is that seasonal naive will probably perform the best, but I will test all 4 models with a test set and predicted set.
thefts_train <- thefts_ts %>%
filter(year(Month) <= 2017)
raw_count.model <- thefts_train %>%
model(
naive = NAIVE(NumThefts),
snaive = SNAIVE(NumThefts),
mean = MEAN(NumThefts),
lm = RW(NumThefts ~ drift()),
)
raw_count.model.forecast <- raw_count.model %>%
forecast(h = "2 years")
raw_count.model.forecast %>%
autoplot(filter(thefts_ts, year(Month) >= 2015))
raw_count.model.forecast %>%
autoplot(thefts_ts)
raw_count.model.forecast %>%
accuracy(thefts_ts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -15.2 657. 558. -1.90 10.9 1.46 1.37 0.656
## 2 mean Test -896. 1108. 922. -18.7 19.1 2.42 2.32 0.654
## 3 naive Test -73.1 656. 540. -2.99 10.7 1.41 1.37 0.654
## 4 snaive Test -41.5 247. 201. -1.05 3.92 0.526 0.517 0.131
percapita.model <- thefts_train %>%
model(
naive = NAIVE(TheftsPer1000),
snaive = SNAIVE(TheftsPer1000),
mean = MEAN(TheftsPer1000),
lm = RW(TheftsPer1000 ~ drift()),
)
percapita.model.forecast <- percapita.model %>%
forecast(h = "2 years")
percapita.model.forecast %>%
autoplot(thefts_ts)
raw_count.model.forecast %>%
accuracy(thefts_ts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test -15.2 657. 558. -1.90 10.9 1.46 1.37 0.656
## 2 mean Test -896. 1108. 922. -18.7 19.1 2.42 2.32 0.654
## 3 naive Test -73.1 656. 540. -2.99 10.7 1.41 1.37 0.654
## 4 snaive Test -41.5 247. 201. -1.05 3.92 0.526 0.517 0.131
percapita.model.forecast %>%
accuracy(thefts_ts)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm Test 0.00620 0.127 0.109 -0.992 11.0 1.50 1.40 0.656
## 2 mean Test -0.155 0.200 0.163 -16.9 17.5 2.24 2.21 0.654
## 3 naive Test -0.00500 0.126 0.106 -2.08 10.7 1.45 1.39 0.654
## 4 snaive Test 0.00106 0.0458 0.0371 -0.157 3.73 0.509 0.506 0.0919
As I expected, the best performing model appears to be the seasonal naive model, across all measures. Now, I will look at the residuals to see whether they look like white noise.
raw_count.snaive <- thefts_train %>%
model(
snaive = SNAIVE(TheftsPer1000)
)
raw_count.snaive %>%
gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
There is still significant autocorrelation in these residuals, and the distribution looks slightly left skewed. Therefore, the seasonal naive method is not appropriate for predicting the number of thefts over time.
unemployment <- read_csv("data/unemployment_long.csv")
## Rows: 288 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unemployment_ts <- unemployment %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
unemployment_ts %>%
autoplot(value)
## Warning: Removed 10 rows containing missing values (`geom_line()`).
unemployment_ts %>%
gg_lag(value, lags = 1:12)
## Warning: Removed 10 rows containing missing values (gg_lag).
unemployment_ts %>%
gg_season()
## Plot variable not specified, automatically selected `y = value`
## Warning: Removed 10 rows containing missing values (`geom_line()`).
unemployment_ts %>%
gg_subseries(value)
## Warning: Removed 1 row containing missing values (`geom_line()`).
fc <- unemployment_ts %>%
stretch_tsibble(.init = 3, .step = 1) %>%
filter(.id < max(.id)) %>%
model(
snaive = SNAIVE(value),
lm = RW(value ~ drift()),
mean = MEAN(value),
naive = NAIVE(value),
) %>% forecast(h = 3) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "value", distribution = value)
unemployment_ts |>
model(RW(value ~ drift())) |>
gg_tsresiduals()
## Warning: Removed 11 rows containing missing values (`geom_line()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing non-finite values (`stat_bin()`).
chicago_population <- read_csv("data/Chicago_Population_Counts.csv") %>%
filter(`Geography Type` == "Citywide")
## Rows: 235 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Geography Type, Geography, Record ID
## dbl (23): Year, Population - Total, Population - Age 0-17, Population - Age ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chicago_pop_ts <- chicago_population %>%
as_tsibble(index = Year, key = c(`Geography Type`, `Geography`, `Record ID`))
# Data from the Chicago.gov website related to population
chicago_pop_ts %>%
ggplot(aes(x = Year, y = `Population - Total`)) + geom_line() + geom_point()
chicago_crime_with_chicago_pop <- chicago_crime_monthly %>%
mutate(Year = year(Month))
chicago_crime_with_chicago_pop <- chicago_crime_with_chicago_pop %>%
filter(Year >= 2018) %>%
inner_join(chicago_pop_ts, by = "Year") %>%
mutate(TheftsPerCapitaChicago = NumThefts / `Population - Total` * 1000)
unique(chicago_crime_with_chicago_pop$`Geography Type`)
## [1] "Citywide"
unique(chicago_crime_with_chicago_pop$Geography)
## [1] "Chicago"
chicago_crime_with_chicago_pop %>%
autoplot(TheftsPerCapitaChicago)
chicago_crime_with_chicago_pop %>%
autoplot(NumThefts)